Mini Project 3:
Visualizing and Maintaining the
Green Canopy of NYC

📚Introduction

Many New Yorkers do not appreciate the trees that benefit them and their environment on a daily basis. Over 1 million trees (specifically 1,093,439 trees) are spread across the Big Apple yet only litter is scattered through most of them. Such people do not consider that these trees are essential for reducing CO2 exposure, provide shelter for birds and squirrels, and provide shade while giving the tree sunlight to grow.

While this project is not meant to start a “stop litter” movement, it analyzes trees and their corresponding district to make a proposal for the NYC Parks Department. Specifically, the goal is to create a new program on why action must be taken in a specific district addressing its trees using visualizations gathered from official NYC data websites.

Setting up code libraries
#Below are the following libraries used for this project.

#Obtaining data and performing SQL like commands
library(sf)
library(tidyverse)
library(httr2)

#Data injection
library(glue)
library(readxl)
library(tidycensus)

#Display datatables
library(DT)

#Visualization library
library(ggplot2)
library(plotly)

library(tidyr)

💽Download NYC City Council District Boundaries

Data was collected from the NYC Department of Planning using the latest release as of making this project, 25C. The shoreline version will be collected as it can display more trees compared to the the water area version.

Downloading the Boundary Data
#The following code was inspired from how we inject data from mp02

#Create directory, if it does not exist already, to store data
if(!dir.exists(file.path("data", "mp03"))){
    dir.create(file.path("data", "mp03"), showWarnings=FALSE, recursive=TRUE)
}

library <- function(pkg){
    ## Mask base::library() to automatically install packages if needed
    ## Masking is important here so downlit picks up packages and links
    ## to documentation
    pkg <- as.character(substitute(pkg))
    options(repos = c(CRAN = "https://cloud.r-project.org"))
    if(!require(pkg, character.only=TRUE, quietly=TRUE)) install.packages(pkg)
    stopifnot(require(pkg, character.only=TRUE, quietly=TRUE))
}

#Define zip file name to indicate whether it will exist
zip_name <- "nycc_25c.zip"

url_path <- "https://s-media.nyc.gov/agencies/dcp/assets/files/zip/data-tools/bytes/city-council/nycc_25c.zip"

#Zip file path
zip_path <- "./data/mp03/"

#Downloads the required file into the correct directory
if(!file.exists(glue(zip_path, zip_name))){
  download.file(url = url_path, destfile = paste0(zip_path, "/", zip_name), mode = "wb")
}

unzipped_pathname <- paste0(zip_path, "nycc_25c/")

#Unzip file if necessary
if(!dir.exists(unzipped_pathname)){
  unzip(paste0(zip_path, "/", zip_name), exdir = zip_path, overwrite = TRUE)    #Paste0 to specify pathname of the file
}


#Read shp file and store it as the data variable
DATA <- sf::st_read(paste0(unzipped_pathname, "nycc.shp"))


#Transform result into WGS 84
DATA <- st_transform(DATA, crs="WGS84")
Raw District Boundary Data Output
#Returning transformed DATA to user
datatable(DATA, style = "bootstrap5", caption = "Raw Data Output")
Explaining the Table

Note: column names were left untouched to show raw data. It may be difficult to understand at first glance.

The datatable may look scary but provides important information later on. Most notably are columns Shape_Leng showing total length of a district in NYC and Shape_Area showing how large the district is. Currently, there are 51 districts to work with.

Data Made Easier

The visualization below makes it much easier to see where trees are being looked at. More specifically, it shows the 5 boroughs of the NYC metropolitan area with a boundary acting as a district.

Show the code
#Visualization of area being worked on
ggplot() +
  geom_sf(data = DATA, mapping = aes(geometry = geometry)) +
         theme_bw()

Show the code
rm(all)

💽Download NYC Tree Points

Since this project focuses on trees, data containing tree location is used as a main metric. The code below downloads the necessary data.

Downloading the Tree Data
#The following code is a modified version of data acquisition from https://michael-weylandt.com/STA9750/archive/AY-2024-SPRING/miniprojects/mini01.html

if(!file.exists("data/mp03/nyc_tree_locations.csv")){
    
    #URL was modified as per instructions
    ENDPOINT <- "https://data.cityofnewyork.us/resource/hn5i-inap.geojson"
    
    BATCH_SIZE <- 50000   #Edit if we start to see long computations for visuals. Same with offset.
    OFFSET     <- 0
    END_OF_EXPORT <- FALSE
    ALL_DATA <- list()
    
    while(!END_OF_EXPORT){
        cat("Requesting items", OFFSET, "to", BATCH_SIZE + OFFSET, "\n")
        
        req <- request(ENDPOINT) |>
                  req_url_query(`$limit`  = BATCH_SIZE, 
                                `$offset` = OFFSET)
        
        resp <- req_perform(req)
        
        batch_data <- st_read(resp_body_string(resp))
        # batch_data <- fromJSON(resp_body_string(resp))
        
        ALL_DATA <- c(ALL_DATA, list(batch_data))
        
        if(NROW(batch_data) != BATCH_SIZE){
            END_OF_EXPORT <- TRUE
            
            cat("End of Data Export Reached\n")
        } else {
            OFFSET <- OFFSET + BATCH_SIZE
        }
    }
    
    ALL_DATA <- bind_rows(ALL_DATA)
    
    cat("Data export complete:", NROW(ALL_DATA), "rows and", NCOL(ALL_DATA), "columns.")

    write_csv(ALL_DATA, "data/mp03/nyc_tree_locations.csv")
}

🗺Mapping️️ NYC Trees

Now that the necessary data has been collected, a visualization will be made to display:

  • Density of trees in a district
  • Exact locations of trees
  • Health of each tree

The visualization will serve as a starting point at which area(s) should be addressed with the best possible reasons.

Creating graph
#Read in data from the files that were downloaded.
boundaries <- st_read('./data/mp03/nycc_25c')
tree_data <- read.csv('./data/mp03/nyc_tree_locations.csv', stringsAsFactors = FALSE) |>
  filter(!is.na(tpcondition), !is.na(geometry)) |>
  #Rename column to be easier to understand on interactive visualization
  rename("Condition" = tpcondition)

# Parse the "c(lon, lat)" string
tree_data_parsed <- tree_data |>
  mutate(coord_str = trimws(gsub("c\\(|\\)", "", geometry))) |>  # Remove "c(" and ")"
  separate_wider_delim(coord_str, delim = ",", names = c("x", "y"), too_few = "align_start") |>
  mutate(
    x = as.numeric(x),
    y = as.numeric(y)
  )

# Create sfc geometry
tree_data$geometry <- st_as_sfc(paste0("POINT(", tree_data_parsed$x, " ", tree_data_parsed$y, ")"))

# Convert to sf
tree_data <- st_as_sf(tree_data)
st_crs(tree_data) <- 4326

#Joining the boundary and tree data
all_data <- st_transform(tree_data, st_crs(boundaries))
all_data <- st_join(all_data, boundaries)
all_data_small <- all_data |>
  slice_head(n=30000)#Used for later questions

#Count trees per district
tree_counts <- all_data |>
  group_by(CounDist) |>
  summarise(tree_count = n(), .groups = 'drop')

#Add findings to boundaries dataset
boundaries <- boundaries |>
  st_join(tree_counts)

#Store plot in variable to make it interactive in the next code block
tree_plot <- ggplot() +
  geom_sf(data = boundaries, mapping = aes(geometry = geometry, fill = tree_count)) +
  scale_fill_gradient(low = "#F0FFF0", high = "#084511", name = "Tree Count") +
  geom_sf(data = all_data_small, mapping = aes(geometry = geometry, color = Condition), alpha = 0.5, size = 0.3) +
  guides(color = "none") +
  scale_color_discrete() +
  labs(color = "Condition",
       title = "Street Trees in NYC by City Council District",
       subtitle = "Points represent the trees, shade shows tree density") +
  guides(color = guide_legend(override.aes = list(size = 3))) +
  theme_bw()
tree_plot
Show the code
#Make plot interactive using plotly
ggplotly(tree_plot)
Notes on the Visualization

Note: The graph contains the first 30000 as points trees due to hardware limitations. The statements below only reflect this visualization and could change afterwards.

Within the 5 boroughs, Staten Island has the greatest density of trees yet most of these trees have an unknown or dead status. The Bronx has a large quantity of trees rated in excellent condition likely due to being far away from the JFK airport and being a starting point outside the metropolitan area. Manhattan also has many trees above the first bottom district, either representing an act was made to plant more trees or is simply used as decoration to attract tourists. This is an interactive graph, explore other areas to find different results!

🌲District-Level Analyses of Trees

With the tree points and district boundaries now connected to one data table, more analysis can be done besides looking at the visualization. For instance, it is must easier to determine which district had the most amount of trees instantly, not having to second guess our answer when doing this visually.

Note that all trees will be included in the following analyses.

Show the code
#Remove datasets that repeat tree data. Also remove redundant values
rm(tree_data, tree_data_parsed, unzipped_pathname, url_path, DATA, zip_name, zip_path, ALL_DATA)

Finding District with Most Trees

District with most trees
#Find the district with the most trees
tree_counts <- all_data |>
  group_by(CounDist) |>
  summarise(tree_count = n(), .groups = 'drop') |>
  mutate(
  Borough = case_when(
    CounDist >= 1  & CounDist <= 10 ~ "Manhattan",
    CounDist >= 11 & CounDist <= 18 ~ "Bronx",
    CounDist >= 19 & CounDist <= 32 ~ "Queens",
    CounDist >= 33 & CounDist <= 48 ~ "Brooklyn",
    CounDist >= 49 & CounDist <= 51 ~ "Staten Island",
    TRUE ~ NA_character_
  )) |>
  arrange(desc(tree_count))

#Create a format_titles variable to make the table columns look nicer. Used in later chunks
#Credit: Professor Michael Weylandt
library(stringr)
format_titles <- function(df){
    colnames(df) <- str_replace_all(colnames(df), "_", " ") |> str_to_title()
    df
}

tree_counts |>
  st_drop_geometry() |>
  slice_head(n=10) |>
  select(CounDist, Borough, tree_count) |>
  format_titles() |>
  rename("Council District" = Coundist) |>
  datatable(style = "bootstrap5", caption = "Top 10 Districts With The Most Trees")
Findings

Council District 51 in Staten Island has the most trees with 70965 recorded. Oddly enough, Staten Island also ranks 2nd and 6th for having the most trees, possibly indicating it is tree dense with so many trees in one borough (Staten Island only has 3 districts).

Many Council Districts for Queens also appear, alluding that there is a good chance trees will be seen whichever neighborhood one enters.

District with Highest Tree Density

Show the code
#Use the Shape_Area column to act as the density maker per district
density_trees <- all_data |>
  st_drop_geometry() |>
  group_by(CounDist) |>
    summarise(
    Shape_Area = first(Shape_Area),  # or sum()/mean() if appropriate
    .groups = "drop"
  ) |>
  left_join(
    tree_counts |>
      st_drop_geometry() |>
      select(CounDist, tree_count, Borough) |>
      distinct(CounDist, .keep_all = TRUE),  # Remove duplicate CounDist rows
    by = "CounDist"
  ) |>
  mutate(
    area_sqkm = as.numeric(Shape_Area) / 1e6,
    tree_density = tree_count / area_sqkm
  ) |>
  arrange(desc(tree_density)) |>
  drop_na() |>
  select(CounDist, Borough, tree_count, area_sqkm, tree_density)
  

density_trees |>
  format_titles() |>
  rename("Council District" = Coundist) |>
  rename("Area (sqkm)" = "Area Sqkm") |>
  rename("Tree Density (sqkm)" = "Tree Density") |>
  datatable(style = "bootstrap5", caption = "Top 10 Districts With Most Dense Trees") |>
  formatRound(c("Area (sqkm)", "Tree Density (sqkm)"), digits = 3)
Findings

Council District 7 in Manhattan has the most dense trees with 283.549 per sqkm recorded. Despite having a near top tree count of 15,000, Council District 7 is the 4th smallest District in all of the NYC metropolitan area and managed to cram the most trees in one place doing so. Compared to the largest district 50 in Staten Island, it has a tree density of about 78 sqkm, likely due to the size of the district.

Manhattan is a borough that excels in density as it crams in whatever it can into the most popular borough worldwide, appearing 5 times in the top 10 list. Having this mindset could also be a reason districts in Manhattan did so well in this category.

District with Most Amount of Dead Trees

Show the code
#Calculating statistics for dead trees
dead_trees <- all_data |>
  st_drop_geometry() |>
  filter(!is.na(Condition), !is.na(CounDist)) |>
  group_by(CounDist) |>
  summarize(total_trees = n(),
            total_dead_trees = sum(Condition == 'Dead', na.rm = TRUE),
            fraction_dead_trees = total_dead_trees / total_trees * 100,
            .groups = 'keep') |>
    left_join(
    tree_counts |>
      st_drop_geometry() |>
      select(CounDist, Borough) |>
      distinct(CounDist, .keep_all = TRUE),
    by = "CounDist"
  ) |>
  select(CounDist, Borough, total_trees, total_dead_trees, fraction_dead_trees) |>
  arrange(desc(fraction_dead_trees))

dead_trees |>
    rename("Council District" = CounDist) |>
    format_titles() |>
    rename("Fraction Dead Trees %" = "Fraction Dead Trees") |>
    datatable(style = "bootstrap5", caption = "Dead Tree Data") |>
    formatRound("Fraction Dead Trees %", digits = 3)
Findings

Council District 32 in Queens has the highest percent of dead trees compared to the rest of its trees with about 14.255% of trees being dead. A reason for this could be that Queens generally does not receive attention like Manhattan would; paired with being a very large borough leads to more required maintenance. District 32 does land in the top 10 of most amount of trees in the district, explaining there is a ton of work to fix those trees.

What’s interesting is that Brooklyn had no districts in this category, showcasing it either has fewer trees than Queens or is capable to maintain them more effectively.

Finding the Most Common Tree Species in Manhattan

Show the code
manhattan_species <- all_data |>
  st_drop_geometry() |>
  left_join(
    tree_counts |>
      st_drop_geometry() |>
      select(CounDist, Borough) |>
      distinct(CounDist, .keep_all = TRUE),
    by = "CounDist"
  ) |>
  filter(Borough == "Manhattan") |>
  group_by(genusspecies) |>
  summarise(count = n(), .groups = 'keep') |>
  arrange(desc(count))
  
  manhattan_species |>
    head(50) |>
    rename("Species" = genusspecies) |>
    format_titles() |>
    datatable(style = "bootstrap5", caption = "Most Common Manhattan Trees")
Findings

The most common tree species in Manhattan is the Thornless honeylocust with 17310 appearances. This appears to be a very common tree across Manhattan as the next most common, the London planetree has about 6000 fewer appearances. Trees quickly go to 4 digits, then 3 digits for total appearance, suggesting the Thornless honeylocust may live longer, can adapt to the industrial standards of Manhattan, and actually thrive compared to other species. More statistics would be needed to verify such a claim.

Tree Species Closest to Baruch College

Show the code
#Create point of Baruch College using longitude and latitude
baruch_point <- st_point(c(-73.98376, 40.74019)) |>     #Point is approximated
  st_sfc(crs = 4326) |>
  st_transform(st_crs(all_data))

#EPSG:2263 is NAD83 / New York Long Island (ftUS); metric used in mapping
baruch_point <- st_transform(baruch_point, 2263)

# Create 1 km buffer around Baruch College (also reduces RAM usage)
buffer_1km <- st_buffer(baruch_point, dist = 1000)

# Filter all_data to only those within the buffer
all_data_near <- st_filter(all_data, buffer_1km, .predicate = st_intersects)

#Finding the tree species closest to Baruch College
baruch_species <- all_data_near |>
  filter(!is.na(genusspecies), !is.na(CounDist)) |>
  left_join(
    tree_counts |>
      st_drop_geometry() |>
      select(CounDist, Borough) |>
      distinct(CounDist, .keep_all = TRUE),
    by = "CounDist"
  ) |>
  mutate(distance_to_baruch = as.numeric(st_distance(geometry, baruch_point))) |>
    arrange(distance_to_baruch) |>
  head(50) |>
  select(genusspecies, distance_to_baruch, riskrating)

baruch_species |>
  st_drop_geometry() |>
  rename("Species" = genusspecies) |>
  format_titles() |>
  datatable(style = "bootstrap5", caption = "Tree Species Closest to Baruch (1km radius)") |>
  formatRound("Distance To Baruch", digits = 4)
Findings

The closest tree species to Baruch College is the Quercus acutissima - sawtooth oak with a distance of 54.3122 meters. This tree has a healthy rating given it has no Risk Rating.

There also appeared to be a trend of Risk Rating and departing away from Baruch College up to 1 km away. Being further away showed more trees with a risk rating present. This could indicate institutions such as Baruch College tend to provide better maintenance for their trees compared to having no college around.

🪵NYC Parks Proposal

Project Description

Walking in an area full of dead trees may feel like a barren wasteland. This statement is what it feels like to enter the Rockaways in District 32. These dead trees must be repurposed in ways that can benefit both humans and wildlife while being replaced with new trees to grow and prosper in their place. Therefore, I propose to replace at least 4,000 trees, remove all stumps, and plant at least 3,500 new trees to effectively renovate District 32 using as little budget needed.

District Map

The following visualization provides information on where trees are located in our district.

Show the code
#Filter for only District 32
dist32 <- all_data |>
  filter(CounDist == 32) |>
  st_join(boundaries, by = "CounDist")

boundary_32 <- boundaries |> 
  filter(CounDist.x == 32)

#Create the visualization
ggplot() +
geom_sf(data = boundary_32, mapping = aes(geometry = geometry)) +
geom_sf(data = dist32, mapping = aes(geometry = geometry, color = Condition, alpha = Condition), size = 1) +
  scale_alpha_manual(values = setNames(c(1, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3), levels(dist32$Condition)), guide="none") +
labs(color = "Condition",
     title = "City Council District 32") +
guides(color = guide_legend(override.aes = list(size = 3))) +
theme_bw() +
theme(aspect.ratio = 0.90)

At first glance, most trees are in excellent or good condition, so why worry? Clearly, the Rockaways at the bottom has healthy trees but as we move upwards, tree condition begins to degrade. The Ozone Park at the top has many trees in the good category while the neighborhood at the right has trees that are critical or dead. Even the trees in between the top and bottom areas lost their excellent condition.

Lets also think about the geographic area being spoken about here. The Rockaways are exposed to more severe weather compared to the rest of NYC given it is surrounded in water in a dense manner. Having debris of dead trees poses a threat for households to lose power or break the structure. It even poses a threat to humans all the time from splinters to having debris thrown into one’s face from the winds. If this is not convincing enough, the numbers might change your mind.

Quantitative Comparison

District 32 had the highest percent of dead trees in NYC, calculated to be about 14.2% with 4315 dead trees sitting there doing nothing. These statistics will be compared with districts 30, 42, and 46 as they are all adjacent to district 32. District 30 is an outliter for not being next to water but can represent how trees perform in different areas.

The chart and bar graph below indicate additional statistics to look at.

Show the code
#Compares percent of dead trees across selected districts
dist_comparison <- dead_trees |>
  st_drop_geometry() |>
  filter(CounDist %in% c(32, 30, 42, 46)) |>
  left_join(
    density_trees |>
      st_drop_geometry() |>
      select(CounDist, area_sqkm, tree_density) |>
      distinct(CounDist, .keep_all = TRUE),
             by = "CounDist")

ggplot(dist_comparison, aes(x=reorder(factor(CounDist), -fraction_dead_trees), y=fraction_dead_trees, fill = CounDist == 32)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(fraction_dead_trees, 2), "%")),
          hjust = -0.1, size = 3) +
  coord_flip() +
  scale_fill_manual(values = c("FALSE" = "lightgray", "TRUE" = "#e95420"), guide = "none") +
  labs(
    title = "Percentage of Dead Trees Across Districts",
    subtitle = "District 32 is highlighted as the target",
    x = "Council District",
    y = "Percentage of Dead Trees"
  ) +
  ylim(0, 15) +
  theme_bw()

Clearly, District 32 has the most amount of dead trees within the area compared to the other selected districts. Reasons could include better maintenance by the community and an area that can have trees thrive more given they are no longer surrounded by water. District 30 stands out as being very close to the dead tree percentage of district 32, but consider district 30 has far greater tree density and count compared to district 32. In short, this is comparing a tree density of 84.4 trees per square kilometer in District 32 to 136.4 trees per square kilometer in District 30. District 32 likely has fewer trees with Rockaway Beach and does not get the attention it deserves to fix the dead tree issue.

Conclusion

If you are still not convinced, take a look of these districts with dead trees only:

Show the code
#Add the 3 districts
boundary_32 <- boundaries |> 
  filter(CounDist.x %in% c(32, 30, 42, 46))

dist32 <- all_data |>
  filter(CounDist %in% c(32, 30, 42, 46), Condition %in% "Dead")

#Create the visualization
ggplot() +
geom_sf(data = boundary_32, mapping = aes(geometry = geometry)) +
geom_sf(data = dist32, mapping = aes(geometry = geometry), size = 0.7, color = "#e1cb7e", alpha = 0.5) +
labs(title = "City Council District 32") +
theme_bw() +
theme(aspect.ratio = 0.90)

Council District 32 should now stand out as having the highest dead tree percentage. Districts like 42 and 46, whom are close to water, also have many dead trees but not in such a dense fashion like District 32. Keep in mind that District 32 is very close to the JFK airport, not having enough trees in excellent condition could negatively impact residents as less CO2 gas is absorbed by the trees. We can start small by targeting the most critical neighborhoods and move our way to replace the dead trees and stumps with new trees in the future.